In 2020, the 59th election for the president of the United States was held. It took place on November 3 and Joe Biden stood as the Democrat side candidate, while Donald Trump ran for re-election on the Republican side.
The United States presidential election in 2020 was accompanied by exceptional economic and social events, which could have had a significant impact on the emotions accompanying voters. These include the COVID-19 pandemic, Racial unrest as a result of the murder of George Floyd, and numerous freedom strikes. What's more, the presidential debate related to economic plans, including in particular tax policy. The deliberations covered other socially sensitive issues such as environmental policy in the light of the ongoing climate change and health protection, known as Obamacare, was discussed.
Many of the events that shook emotions in 2020 had a political backing, and dividing voters mainly into two parties, i.e. Democrats and Republicans, only exacerbated the conflicts.
Will it be possible to evaluate voter's sentimentality through their entries on the Tweeter platform? Can the two parties behind Joe Biden and Donald Trump be distinguished by the emotions expressed on social media? Do any of the voters have clearly more extreme emotions?
For this purpose, a set of data containing tens of thousands of entries on Tweeter with the hashtag #JoeBiden or #DonaldTrump was analyzed.
import pandas as pd
import numpy as np
%matplotlib inline
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()
from wordcloud import WordCloud,STOPWORDS
import matplotlib.pyplot as plt
import plotly.express as px
First, the data structure was looked at, redundant columns and N/A were removed.
data_biden = pd.read_csv('./Election data/hashtag_joebiden.csv', lineterminator='\n', parse_dates=True)
data_biden.head()
data_biden = data_biden.dropna()
data_biden = data_biden[['created_at', 'tweet']]
data_biden.rename(columns={'created_at': 'Timestamp', 'tweet': 'Text'}, inplace=True)
data_biden.to_csv("./Election data/data_joebiden.csv", index=None)
txtm_biden = pd.read_csv('./Election data/data_joebiden.csv')
txtm_biden['Text'] = txtm_biden['Text'].astype(str)
print(txtm_biden.shape)
txtm_biden.head()
From the NLTK package, we will use functions to calculate sentimental indicators. The VADER (Valence Aware Dictionary for Sentiment Reasoning) method was chosen.
%time
i=0
compval1 = [ ]
while (i<len(txtm_biden)):
k = analyser.polarity_scores(txtm_biden.iloc[i]['Text'])
compval1.append(k['compound'])
i = i+1
compval1 = np.array(compval1)
# The result of the operation is added to our data set
txtm_biden['VADER score'] = compval1
The Vader indicator should be interpreted that all values above 0 will be considered negative, and the Author decides to consider those above 0.7 as predicted strongly positive.
%time
i = 0
predicted_value = [ ] # empty series to hold our predicted values
while(i<len(txtm_biden)):
if ((txtm_biden.iloc[i]['VADER score'] >= 0.7)):
predicted_value.append('positive')
i = i+1
elif ((txtm_biden.iloc[i]['VADER score'] > 0) & (txtm_biden.iloc[i]['VADER score'] < 0.7)):
predicted_value.append('neutral')
i = i+1
elif ((txtm_biden.iloc[i]['VADER score'] <= 0)):
predicted_value.append('negative')
i = i+1
# Add predicted values to our data set and seperate days from the 'Timestmap' column
txtm_biden['predicted sentiment'] = predicted_value
txtm_biden['Date'] = txtm_biden['Timestamp'].apply(lambda s: s.split()[0])
# We save our data set with evaluated scores for later
txtm_biden.to_csv("/Users/damianzamojda/Desktop/DS_2020_21/TM Project/TXM Project Sentimental US election 2020/Election data/data_joebiden_sent.csv", index=None)
txtm_biden_sentiment = pd.read_csv('/Users/damianzamojda/Desktop/DS_2020_21/TM Project/TXM Project Sentimental US election 2020/Election data/data_joebiden_sent.csv')
txtm_biden_sentiment.head(5)
We will use the VADER score tool and time-structured data to check whether tweets are characterized by sentimentality that changes over time.
Public opinion is shaped by the information that reaches it, and that information can be manipulated (Max Weber, 1988). This type of activity might be conducive to the time of the election campaign, when the activity of political parties increases and their activities are accompanied by increased media interest.
As public opinion does not exist per se, political communication is possible (Eric Maigret, 2012).
Considering the above, politicians use various methods of communication and opinion transmission, which may be characterized by a certain level of sentimentality, examined at work.
How sensitive is the content to the upcoming elections and what is the percentage of Tweets classified as positive compared to potentially negative before and after the November 3 elections? The chart also had information that on November 6 the so-called 'Swing States' were already recognizable.
txtm_biden_sentiment.groupby('predicted sentiment').size().plot(kind='bar')
#data preparation
txtm_biden_temp = txtm_biden_sentiment.groupby(['Date','predicted sentiment']).count().reset_index()
txtm_biden_temp = txtm_biden_temp[['Date', 'predicted sentiment', 'Timestamp']]
txtm_biden_temp.rename(columns={'Date': 'Date', 'predicted sentiment': 'Sentiment','Timestamp': 'Count'}, inplace=True)
txtm_biden_temp = np.array(txtm_biden_temp)
txtm_biden_temp = pd.DataFrame(txtm_biden_temp[3:78])
txtm_biden_temp.rename(columns={0: 'Date', 1: 'Sentiment',2: 'Count'}, inplace=True)
txtm_biden_temp.head(10)
# crosstabed data structure is needed to this type of plot
p = np.array(txtm_biden_temp['Date'])
q = np.array(txtm_biden_temp['Sentiment'])
r = np.array(txtm_biden_temp['Count'])
crossed_data = pd.crosstab(p, columns=q, values=r, aggfunc='sum', rownames='p', colnames='r')
crossed_data['sum'] = crossed_data['negative']+crossed_data['neutral']+crossed_data['positive']
crossed_data['pos_per'] = crossed_data['positive']/crossed_data['sum']
crossed_data['neg_per'] = crossed_data['negative']/crossed_data['sum']
crossed_data['neu_per'] = crossed_data['neutral']/crossed_data['sum']
crossed_data['date'] = crossed_data.index
crossed_data = np.array(crossed_data)
crossed_data = pd.DataFrame(crossed_data)
Basic indicators per day have been calculated and showed below:
crossed_data.rename(columns={0: 'Negative', 1: 'Neutral',2: 'Positive',3: 'Sum', 4: 'Posper',5: 'Negper',6: 'Neuper',7:'Date'}, inplace=True)
crossed_data.head(5)
df = crossed_data[['Date','Posper','Negper','Neuper']]
plt = df.plot.area(x='Date', y=['Posper','Negper','Neuper'],colormap='winter', figsize=(10,7))
plt.axvline(x=22, color='r', linestyle='--', lw=3, label='6th November - US Election Day')
plt.axvline(x=19, color='y', linestyle='--', lw=3, label='3th November - US Election Day')
plt.set_xlabel('')
plt.set_ylabel('Predicted Sentiment Share')
plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left')
From the chart, you can observe some increase in negative tweets - perhaps as a result of the signs of potential loss of Joe Biden.
On the other hand, the date of November 6, when it was already possible to observe which states will decide who will become the next president of the United States, the share of negative tweets fell to the lowest level in the entire period. Could a reversal of sentiment be inferred from this moment?
You can definitely observe the 'thickening' of moods and the frequency of shared posts. After November 3, there were the most tweets - see the plot below.
fig1 = px.scatter(txtm_biden_sentiment, x="Timestamp",
y="VADER score",
hover_data=["VADER score"],
color_discrete_sequence=["lightseagreen", "indianred", "mediumpurple"],
color="predicted sentiment",
size_max=10,
title=f"Biden Tweets"
)
fig1
All process considering data preparation were very simillar to Joe Biden's Part, the data structure was looked at, redundant columns and N/A were removed.
data_trump = pd.read_csv('./Election data/hashtag_donaldtrump.csv', lineterminator='\n', parse_dates=True)
data_trump.head()
data_trump = data_trump.dropna()
data_trump = data_trump[['created_at', 'tweet']]
data_trump.rename(columns={'created_at': 'Timestamp', 'tweet': 'Text'}, inplace=True)
data_trump.to_csv("./Election data/data_trump.csv", index=None)
txtm_trump = pd.read_csv('./Election data/data_trump.csv')
txtm_trump['Text'] = txtm_trump['Text'].astype(str)
print(txtm_trump.shape)
txtm_trump.head()
%time
i=0
compval2 = [ ]
while (i<len(txtm_trump)):
k = analyser.polarity_scores(txtm_trump.iloc[i]['Text'])
compval2.append(k['compound'])
i = i+1
compval2 = np.array(compval2)
# The result of the operation is added to our data set
txtm_trump['VADER score'] = compval2
len(compval2)
%time
i = 0
predicted_value = [ ]
# Very same levels
while(i<len(txtm_trump)):
if ((txtm_trump.iloc[i]['VADER score'] >= 0.7)):
predicted_value.append('positive')
i = i+1
elif ((txtm_trump.iloc[i]['VADER score'] > 0) & (txtm_trump.iloc[i]['VADER score'] < 0.7)):
predicted_value.append('neutral')
i = i+1
elif ((txtm_trump.iloc[i]['VADER score'] <= 0)):
predicted_value.append('negative')
i = i+1
# Add predicted values to our data set and seperate days from the 'Timestmap' column
txtm_trump['predicted sentiment'] = predicted_value
txtm_trump['Date'] = txtm_trump['Timestamp'].apply(lambda s: s.split()[0])
# We save our data set with evaluated scores for later
txtm_trump.to_csv("./Election data/data_trump_sent.csv", index=None)
txtm_trump_sent = pd.read_csv('./Election data/data_trump_sent.csv')
txtm_trump_sent.head()
txtm_trump_sent.groupby('predicted sentiment').size().plot(kind='bar')
We can observe that the percentage of negative Tweets is also the majority of Tweets with the hashtag #Trump. The high percentage of negative Tweets may result from the levels at which the VADER indicator is interpreted, but how do both groups of data relate to each other?
boxplot_df = txtm_biden_sentiment.groupby('predicted sentiment').size()
boxplot_df = pd.DataFrame(boxplot_df)
boxplot_df['Trump'] = txtm_trump_sent.groupby('predicted sentiment').size()
boxplot_df.rename(columns={0: 'Biden'}, inplace=True)
boxplot_df.plot(kind='bar',title='Sentimental Analysis for All Tweets', figsize=(10,7))
It is now more evident that in terms of the number of Tweets #Trump, more were classified as negative and neutral, while there is less contribution to positive Tweets than #Biden.
txtm_trump_temp = txtm_trump_sent.groupby(['Date','predicted sentiment']).count().reset_index()
txtm_trump_temp = txtm_trump_temp[['Date', 'predicted sentiment', 'Timestamp']]
txtm_trump_temp.rename(columns={'Date': 'Date', 'predicted sentiment': 'Sentiment','Timestamp': 'Count'}, inplace=True)
txtm_trump_temp = np.array(txtm_trump_temp)
txtm_trump_temp = pd.DataFrame(txtm_trump_temp[3:77])
txtm_trump_temp.rename(columns={0: 'Date', 1: 'Sentiment',2: 'Count'}, inplace=True)
txtm_trump_temp.head(10)
# crosstabed data structure is needed to this type of plot
p = np.array(txtm_trump_temp['Date'])
q = np.array(txtm_trump_temp['Sentiment'])
r = np.array(txtm_trump_temp['Count'])
crossed_data_trump = pd.crosstab(p, columns=q, values=r, aggfunc='sum', rownames='p', colnames='r')
crossed_data_trump=crossed_data_trump.dropna()
crossed_data_trump['sum'] = crossed_data_trump['negative']+crossed_data_trump['neutral']+crossed_data_trump['positive']
crossed_data_trump['pos_per'] = crossed_data_trump['positive']/crossed_data_trump['sum']
crossed_data_trump['neg_per'] = crossed_data_trump['negative']/crossed_data_trump['sum']
crossed_data_trump['neu_per'] = crossed_data_trump['neutral']/crossed_data_trump['sum']
crossed_data_trump['date'] = crossed_data_trump.index
Basic indicators per day have been calculated and showed below:
crossed_data_trump = np.array(crossed_data_trump)
txtm_trump_df = pd.DataFrame(crossed_data_trump)
txtm_trump_df.rename(columns={0: 'Negative', 1: 'Neutral',2: 'Positive',3: 'Sum', 4: 'Posper',5: 'Negper',6: 'Neuper',7:'Date'}, inplace=True)
txtm_trump_df.head(5)
df = txtm_trump_df[['Date','Posper','Negper','Neuper']]
plt = df.plot.area(x='Date', y=['Posper','Negper','Neuper'],colormap='winter', figsize=(10,7))
plt.axvline(x=20, color='r', linestyle='--', lw=3, label='6th November - US Election Day')
plt.axvline(x=17, color='y', linestyle='--', lw=3, label='3th November - US Election Day')
plt.set_xlabel('')
plt.set_ylabel('Predicted Sentiment Share')
plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left')
From the chart, you can observe some increase in negative tweets and decrease of neutral posts mostly - perhaps as a result of the signs of potential loss of Joe Biden, there is no significant fall of positive Tweets.
On the other hand, the date of November 6, when the 'Swing States' were visible, the share of negative tweets rised to the highest level in the entire period. Could a reversal of sentiment be inferred from this moment?
We observe the total oposit reaction to the observed one with #Biden.
Let's check what words most often appeared in posted Tweets, depending on whether they contained #Biden or #Trump. After the first analysis, we expect to see more words associated with negativity on the side of the #Trump axis. The words on the side of the #Biden axis should be more emotionally neutral.
import pandas as pd
import numpy as np
%matplotlib inline
import spacy
import en_core_web_sm
import nltk
#nltk.download('punkt')
import nltk.corpus
nlp = en_core_web_sm.load()
%matplotlib inline
import scattertext as st
import re, io
import os, pkgutil, json, urllib
from urllib.request import urlopen
from IPython.display import IFrame
from IPython.core.display import display, HTML
from scattertext import CorpusFromPandas, produce_scattertext_explorer
display(HTML("<style>.container { width:98% !important; }</style>"))
from nltk.corpus import stopwords
#nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.stem import PorterStemmer
from nltk.tokenize.treebank import TreebankWordDetokenizer
from pprint import pprint
from scipy.stats import rankdata, hmean, norm
def filterStop_words(words):
stop_words = set(stopwords.words(['english','german']))
filtered_sentence = []
stop_words = stop_words
for w in words:
if w not in stop_words:
filtered_sentence.append(w)
return filtered_sentence
txtm_trump = pd.read_csv('./Election data/data_trump.csv')
print(txtm_trump.shape)
#txtm_trump['Text'] = txtm_trump['Text'].astype(str)
txtm_trump.head()
txtm_biden = pd.read_csv('./Election data/data_joebiden.csv')
print(txtm_biden.shape)
txtm_biden.head()
# sample to shorten execution time and assure equal subpopulations
trump_plot_data = txtm_trump.sample(30000)
trump_plot_data['Candidate'] = "Trump"
trump_plot_data = trump_plot_data[['Text','Candidate']]
biden_plot_data = txtm_biden.sample(30000)
biden_plot_data['Candidate'] = "JoeBiden"
biden_plot_data = biden_plot_data[['Text','Candidate']]
plot_data_union = pd.concat([trump_plot_data, biden_plot_data])
plot_data_union
import re
# Regex cleaning
plot_data_union['Text'] = plot_data_union['Text'].astype(str)
plot_data_union['test'] = plot_data_union.apply(
lambda row: re.sub(r'[^a-zA-Z ]', '',row['Text']),
axis=1
)
plot_data_union['test1'] = plot_data_union.apply(
lambda row: re.sub(r'.rump', '',row['test']),
axis=1
)
plot_data_union['test2'] = plot_data_union.apply(
lambda row: re.sub(r'.iden', '',row['test1']),
axis=1
)
plot_data_union['test3'] = plot_data_union.apply(
lambda row: re.sub(r'.oe', '',row['test2']),
axis=1
)
plot_data_union['Clean'] = plot_data_union.apply(
lambda row: re.sub(r'.onald', '',row['test3']),
axis=1
)
plot_data_union = plot_data_union.drop(['test', 'test1', 'test2', 'test3'], axis = 1)
plot_data_union
plot_data_union['Parsed'] = plot_data_union.Clean.apply(word_tokenize)
plot_data_union['Filtered'] = plot_data_union.Parsed.apply(filterStop_words)
plot_data_union['deTokenized'] = plot_data_union.Filtered.apply(TreebankWordDetokenizer().detokenize)
plot_data_union = plot_data_union.drop(['Clean', 'Parsed', 'Filtered'], axis = 1)
plot_data_union
plot_data_union.to_csv("./Election data/plot_data_union.csv", index=None)
plot_data_temp = pd.read_csv("./Election data/plot_data_union.csv")
plot_data_temp['deTokenized'] = plot_data_temp['deTokenized'].astype(str)
plot_data_temp['Final'] = plot_data_temp.deTokenized.apply(nlp)
plot_data_temp
corpus = st.CorpusFromParsedDocuments(plot_data_temp, category_col='Candidate', parsed_col='Final').build()
# Visualize the chart
html = produce_scattertext_explorer(corpus,
category='Trump',
category_name='D.Trump',
not_category_name='J.Biden',
width_in_pixels=1000,
minimum_term_frequency=50,
transform=st.Scalers.log_scale_standardize)
file_name = 'Election2020ScattertextScale.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)
Scattertext is visualization of how language differs among document types. The data has been processed in such a way as to allow you to suspect the most frequently used words depending on whether they belong to #Trump or #Biden.
Based on the sentimental analysis using the VADER method, it was possible to subordinate the estimated sentiment, i.e. neutral, positive or negative. The levels selected by the Author suggest a strong advantage of Tweets with a predicted negative nature, but it may result, among others, from with high levels fixed for positive. On the other hand, the advantage of the ratio of negative tweets to the total in the case of #Trump over the same ratio in the case of #Biden tweets, the author leaves for free interpretation.
When the timeline is superimposed on the sentimental analysis, as a result of the visual analysis, it is argued that external factors may influence the share of negative Tweets in total. The superimposition of the election dates on the president of the United States and the dates of disclosure of the 'Swing States' are only a visual device of the Author and it is not stated that these events have an unequivocal influence on the studied variable.
Changes in social relations and aggressiveness in communication over time may highlight the use of political narratives in the timeline of the presidential elections.
Bearing in mind the works of other authors, that public opinion can be directed and is guided by the behavior and opinions of politicians, one can try to prove that the moods of voters of individual parties are analogous or correlated with the moods accompanying the parties themselves.